neural network learning and memorization
- North America > Canada (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Review for NeurIPS paper: Neural Networks Learning and Memorization with (almost) no Over-Parameterization
Weaknesses: One of my concerns is the rigorousness of the paper. A key lemma, namely Lemma 12 in the supplementary material is only given with a proof sketch. Moreover, in the proof sketch, how the authors handle the general M-decent activation functions is discussed very ambiguously. This makes the results for ReLU activation function particularly questionable. The significance and novelty of this paper compared with the existing results are also not fully demonstrated. It is claimed in this paper that a tight analysis is given on the convergence of NTK to its expectations.
Review for NeurIPS paper: Neural Networks Learning and Memorization with (almost) no Over-Parameterization
This paper studies optimization in the NTK regime, further improving the best prior width bounds for random data (I believe Oymak-Soltanolkotabi were the prior best). The reviewers and I were all favorable, and I look forward to seeing this paper appear, and support the authors in further investigations. Relatedly, this point was not sufficiently handled in the rebuttal, despite the rebuttal using less than half a page. Please consider such things in the future. One is by Roman Vershynin, and I believe Sebastien Bubeck and colleagues also had a paper on the "Baum" problem.
Neural Networks Learning and Memorization with (almost) no Over-Parameterization
Many results in recent years established polynomial time learnability of various models via neural networks algorithms (e.g. However, unless the model is linear separable \cite{brutzkus2018sgd}, or the activation is a polynomial \cite{ge2019mildly}, these results require very large networks -- much more than what is needed for the mere existence of a good predictor. In this paper we prove that SGD on depth two neural networks can memorize samples, learn polynomials with bounded weights, and learn certain kernel spaces, with {\em near optimal} network size, sample complexity, and runtime. In particular, we show that SGD on depth two network with \tilde{O}\left(\frac{m}{d}\right) hidden neurons (and hence \tilde{O}(m) parameters) can memorize m random labeled points in \sphere {d-1} .